Aim

The aim of this package is to simplify RNAseq analysis by integrating some pre-existing widely used packages (DESeq21, topGO2, WGCNA3 and rrvgo5, in a tidyverse6 environment) to obtain results and visualization after performing differential expression analysis, PCA, Gene Ontology enrichment and gene co-expression clustering.

The data used as example in this vignette was generated in the work of Liang et al.7 in the liverwort Marchantia polymorpha.

Standard workflow

Quick start

The main function in RNAseqEasy package is DESeq2_simple(), which performs principal component analysis, differential expression analysis and Gene Ontology enrichment analysis from quantified transcript data.

DiffExprAn <- DESeq2_simple(
  output_path = output_Dir_DESeq2, # Path to save results
  sampleDir = sampleDir, # Directory with your Salmon quantified data
  sample_table = sample_table, # Data frame with your sample names and variables
  tx2gene = Marchantia7_tx2gene, # Data frame with your Gene - Transcript relationship
  Variable = c("Variable1", "Variable2"), # Variables from sample_table affecting your analysis
  Design = "Variable1 + Variable2 + Variable1:Variable2", # How to model samples
  Name = "Name_comparison", # Name to include in result files
  Contrast = list(c("ContrastA", "ContrastB")) or ContrastA - ContrastB, 
  geneID2GO = geneID2GO, # GO functional annotation
  orgdb = "org.At.tair.db" # Reference organism from rrvgo database
  )

DESeq2_simple() function general scheme

Input data

The RNAseqEasy package is designed to work with data mapped and quantified by Salmon. The first required input is the path [1] to the directory where quant.sf sample files are saved. The second key input is a data frame [2] that contains the sample names and their corresponding descriptive information (i.e. variables or conditions defining the experiment).

Salmon data

Salmon4 is a widely used software for quantifying transcript abundance. A well detailed tutorial can be found in their official webpage. It is a very fast tool for transcript quantification directly from fastq.gzfiles. The only requirement is the creation of an index of the transcriptome of the organism we are working with (do not use the genome!!!), also explained in their tutorial.

As a result, we will obtain a folder [1] containing one folder per analyzed sample with its names. In each folder, the main output file will be named quant.sf, which contains the name of each transcript and their abundance in Transcripts Per Million (TPM), among other information (length, effective length and number of reads). The path to this output folder will be the first input for our analysis [1].

In this example, we will download the quant.sf files from Liang et al.7 experiment.

# URL from Zenodo
data_url <- "https://zenodo.org/records/15800134/files/GSE275561.zip?download=1"

# Temporal directory to download data
output_dir <- file.path(tempdir(), "GSE275561")
dir.create(output_dir, recursive = TRUE, showWarnings = FALSE)

zip_file <- file.path(tempdir(), "GSE275561.zip")

# Download zip file with Salmon quantified data from Zenodo
download.file(url = data_url, destfile = zip_file, mode = "wb")

# Unzip compressed file
unzip(zip_file, exdir = output_dir)

list.files(output_dir)
## [1] "__MACOSX"              "02_Salmon"             "Sample_Data_Wendy.txt"

So, we set the “02_Salmon” folder as sampleDir, which will direct the analysis to the directory to retrieve quantified data for each sample.

sampleDir <- file.path(output_dir, "02_Salmon")

Sample table

For each analysis, you need to create a data frame [2] (which we’ll call sample_table) that links the name of each sequenced sample with its descriptive variables. You can create it in R or import it from a file. It has to include a Sample column including sample names, and an extra column for each variable of factor that were included in the experiment design. In our example, we include 12 different samples (Wendy1 to Wendy12), and there were 3 different variables describing them:

  • Genotype. Samples were divided in two different genotypes: ‘Tak1’ (wild-type organisms) and ‘gh3a’ (CRISPR-Cas9 mutants).
  • Treatment. Two different treatments were applied to samples: ‘Mock’ or ‘OPDA’ (exposure to a high concentration of a plant stress hormone from jasmonic acid family).
  • Replicate. Each category of samples included 3 independent biological replicates.
# Data frame with 'Sample', 'Genotype', 'Treatment and 'Replicate' information

sample_table <- data.frame(
  Sample = paste0("Wendy", seq(1,12)),
  Genotype = rep(c("Tak1", "gh3a"), each = 6),
  Treatment = rep(c("Mock", "OPDA"), each = 3),
  Replicate = seq(1,3)
)

sample_table
##     Sample Genotype Treatment Replicate
## 1   Wendy1     Tak1      Mock         1
## 2   Wendy2     Tak1      Mock         2
## 3   Wendy3     Tak1      Mock         3
## 4   Wendy4     Tak1      OPDA         1
## 5   Wendy5     Tak1      OPDA         2
## 6   Wendy6     Tak1      OPDA         3
## 7   Wendy7     gh3a      Mock         1
## 8   Wendy8     gh3a      Mock         2
## 9   Wendy9     gh3a      Mock         3
## 10 Wendy10     gh3a      OPDA         1
## 11 Wendy11     gh3a      OPDA         2
## 12 Wendy12     gh3a      OPDA         3

In differential expression analysis, comparisons are performed against a reference level. By default, R sets levels based on alphabetical order for each variable. While you can specify the desired comparison in each analysis, it is good practice to set the factor level order beforehand. We transform each variable column into factor and establish the level order. In this example, “Tak-1” is the reference genotype, and “Mock” is the reference treatment.

# Convert 'Genotype' into factor, setting "Tak1" as reference
sample_table$Genotype <- factor(sample_table$Genotype,
                                levels = c("Tak1", "gh3a"))

# Convert 'Treatment' into factor, setting "Mock" as reference
sample_table$Treatment <- factor(sample_table$Treatment, levels = c("Mock", "OPDA"))

str(sample_table)
## 'data.frame':    12 obs. of  4 variables:
##  $ Sample   : chr  "Wendy1" "Wendy2" "Wendy3" "Wendy4" ...
##  $ Genotype : Factor w/ 2 levels "Tak1","gh3a": 1 1 1 1 1 1 2 2 2 2 ...
##  $ Treatment: Factor w/ 2 levels "Mock","OPDA": 1 1 1 2 2 2 1 1 1 2 ...
##  $ Replicate: int  1 2 3 1 2 3 1 2 3 1 ...

Functional annotation

The remaining data needed is specific to your organism of interest, not the experiment itself. There are two different files that we will need to import for the analysis:

  • A data frame that correlates each transcript name with its corresponding gene name.
  • A file including functional annotation, i.e., correlating each gene/transcript to GO terms.

In this example, we are working with samples extracted from the liverwort Marchantia polymorpha, a model plant widely use to study how ancient or conserved biological processes are in plant evolution. To get the reference transcriptome, we download it from marchantia.info, the Genome Database for Marchantia polymorpha. The get.fasta.name() function from phylotools8 package allows us to obtain the transcript names from the fasta file. In Marchantia, gene names and transcript share the same nomenclature, only adding a dot and a number to the different transcripts of a gene. To correlate them, we create a data frame including a column called Name containing transcript names, and based on that we create a new column called GENEID removing last 2 characters (using str_sub() function from string package).

library(phylotools)
library(tidyverse)
URL <- gzcon(url(paste("https://marchantia.info/download/MpTak_v6.1/", "MpTak_v6.1r1.mrna.fasta.gz", sep="")))
txt <- readLines(URL)
Marchantia7_Transcripts <- get.fasta.name(textConnection(txt), clean_name = FALSE)
head(Marchantia7_Transcripts)
## [1] "Mp1g00070.1" "Mp1g00080.1" "Mp1g00090.1" "Mp1g00090.2" "Mp1g00090.3"
## [6] "Mp1g00090.4"
Marchantia7_tx2gene <- data.frame(Name = Marchantia7_Transcripts) %>%
  tidyr::separate(Name, sep = " ", into = c("TXNAME", "CDS")) %>%
  dplyr::mutate(GENEID = stringr::str_sub(TXNAME, 1, -3)) %>%
  dplyr::select(-CDS)

head(Marchantia7_tx2gene, 10)
##         TXNAME    GENEID
## 1  Mp1g00070.1 Mp1g00070
## 2  Mp1g00080.1 Mp1g00080
## 3  Mp1g00090.1 Mp1g00090
## 4  Mp1g00090.2 Mp1g00090
## 5  Mp1g00090.3 Mp1g00090
## 6  Mp1g00090.4 Mp1g00090
## 7  Mp1g00090.5 Mp1g00090
## 8  Mp1g00090.6 Mp1g00090
## 9  Mp1g00100.1 Mp1g00100
## 10 Mp1g00110.1 Mp1g00110

For this example, functional annotation of Gene Ontologies (GOs) for Marchantia genes can be downloaded from the package in a csv file. Many organisms reference databases include this kind of files. If you cannot find them, try to find tools and tutorials to annotate your reference genome.

library(RNAseqEasy)

GO_data <- system.file("extdata",
                      "Mpo6.1_GO_db_GeneGO.csv",
                      package="RNAseqEasy", mustWork=TRUE)
Mpo_GO_GOSLIM <- read.csv(GO_data, sep = "\t")
head(Mpo_GO_GOSLIM)
##   Transcript         GO
## 1  Mp1g00060 GO:0003676
## 2  Mp1g00060 GO:0032774
## 3  Mp1g00060 GO:0034062
## 4  Mp1g00060 GO:0042644
## 5  Mp1g00060 GO:0046914
## 6  Mp1g00070 GO:0000075

This annotation file is a two column data frame correlating transcript names and GO terms (one assotiation per row). For topGO enrichment analysis, we will need to convert this data frame into a list where each element is a transcript ID (the key) and its content is a vector of associated GO terms (the values). The load_topGO_db() function included in this package helps with this purpose.

geneID2GO <- load_topGO_db(Mpo_GO_GOSLIM, "Transcript", "GO")
head(geneID2GO)
## $Mp1g00060
## [1] "GO:0003676" "GO:0032774" "GO:0034062" "GO:0042644" "GO:0046914"
## 
## $Mp1g00070
##  [1] "GO:0000075" "GO:0000123" "GO:0000725" "GO:0000790" "GO:0000800"
##  [6] "GO:0000976" "GO:0001894" "GO:0003682" "GO:0003700" "GO:0004842"
## [11] "GO:0005704" "GO:0005759" "GO:0005813" "GO:0005829" "GO:0005886"
## [16] "GO:0006302" "GO:0007623" "GO:0008023" "GO:0008157" "GO:0008270"
## [21] "GO:0009653" "GO:0010074" "GO:0010332" "GO:0016458" "GO:0016607"
## [26] "GO:0031057" "GO:0031062" "GO:0031436" "GO:0032409" "GO:0032786"
## [31] "GO:0034243" "GO:0035065" "GO:0036464" "GO:0040029" "GO:0042127"
## [36] "GO:0042800" "GO:0042803" "GO:0042981" "GO:0043970" "GO:0044030"
## [41] "GO:0044093" "GO:0044648" "GO:0044666" "GO:0045322" "GO:0045944"
## [46] "GO:0048872" "GO:0051050" "GO:0051053" "GO:0051569" "GO:0065003"
## [51] "GO:0070531" "GO:0070577" "GO:0071339" "GO:0071479" "GO:0080182"
## [56] "GO:0099402" "GO:1902750" "GO:1903507" "GO:1990904" "GO:2000113"
## [61] "GO:2001025" "GO:2001038"
## 
## $Mp1g00080
## [1] "GO:0005794" "GO:0009570" "GO:0009941"
## 
## $Mp1g00090
## [1] "GO:0005773" "GO:0005829" "GO:0012505" "GO:0031410" "GO:0034272"
## [6] "GO:0098588" "GO:0098805"
## 
## $Mp1g00100
##  [1] "GO:0000137" "GO:0000138" "GO:0005634" "GO:0005768" "GO:0005797"
##  [6] "GO:0005802" "GO:0007155" "GO:0008194" "GO:0009507" "GO:0009832"
## [11] "GO:0010214" "GO:0010395" "GO:0031224" "GO:0045489" "GO:0070592"
## [16] "GO:0080157"
## 
## $Mp1g00110
##  [1] "GO:0000491" "GO:0001094" "GO:0005634" "GO:0005737" "GO:0042802"
##  [6] "GO:0051117" "GO:0070062" "GO:0070761" "GO:0097159" "GO:1901363"

To perform GO semantic clustering, the analysis is based on rrvgo5 and GOSemSim9 packages. The available species for this analysis are included in Bioconductor webpage. The only plant organism included is Arabidopsis thaliana, so we will use its code (org.At.tair.db) for this analysis. We can save this data in an object, so it will save computing time in the following analysis. There are three classes of GO terms: Biological Processes (BP), Molecular Functions (MF) and Cell Components (CC). In this example, we will only be focused on biological processes.

save <- GOSemSim::godata("org.At.tair.db", ont="BP")

Differential gene expression analysis

At this point, we already have everything we need to perform our RNA-seq analysis: a directory with our quantified transcript data (sampleDir), a data frame with the experiment metadata (sample_table), a data frame correlating gene names and transcript names for our organism (Marchantia7_tx2gene), an annotation list correlating transcript names and their annotated GO terms (geneID2GO) and the model organism reference data for the semantic clustering (save).

We want to obtain differential expressed genes (DEGs) in Tak-1 genotype in OPDA treatment compared to Tak-1 genotype in Mock conditions. To organize our results, we will create a folder to save all differential expression analyses and we will call it “DESeq2”, and inside we will create another folder to save specifically this comparison, and we will call it “Tak1_OPDA”.

dir.create(file.path(output_dir, "DESeq2"))
dir.create(file.path(output_dir, "DESeq2", "Tak1_OPDA"))
output_Dir_DESeq2 <- file.path(output_dir, "DESeq2", "Tak1_OPDA")

To run the analysis, we should install DESeq2 and topGO packages from Bioconductor and load them in our R session.

Now, we are ready to run our differential expression analysis. For that, we will use DESeq2_simple function. We set all required parameters that we previously defined (output_path, sampleDir, sample_table, tx2gene, geneID2GO, orgdb and semdata). But, still, there are some parameters that we have to define for this analysis:

  • Name: We have to include a string to name all files that are going to be generated. In this example, we use the same as the folder where they will be saved (“Tak1_OPDA”).
  • Variable: Vector of variables or factors that are affecting gene expression in the experiment.
  • Design: Design formula indicating the variables which will be used in the modelling, i.e., which variables we expect that are having an effect in gene expression in the analysis. In this example, we expect that “Genotype” itself (there will be genes which expression will be expected to be affected only by genotype in any condition), “Treatment” itself (there will be genes which expression will be expected to be affected only by treatment in any genotype), and the interaction “Genotype+Treatment” (there will be genes which expression will be expected to be affected differently by treatment in a genotype compared to the other), will be affecting gene expression. Therefore, our design formula will be “Genotype + Treatment + Genotype:Treatment”. (For more information about design, read the DESeq2 vignette).
  • Contrast: The specific contrast we want to perform to test differential expression. If you do not specify a contrast, the function will stop and print all possible contrasts that can be used with the current experimental design. There are two different ways to define the contrast we want to:
    • Condition A - Condition B.
    • list(c(“Contrast”)).

For more information about contrasts, read the DESeq2 vignette and check this tutorial.

library(DESeq2)
library(topGO)

DESeq2_simple(
    output_path = output_Dir_DESeq2,
    sampleDir = sampleDir,
    sample_table = sample_table,
    tx2gene = Marchantia7_tx2gene,
    geneID2GO = geneID2GO,
    orgdb = "org.At.tair.db",
    semdata = save,
    Name = "Tak1_OPDA",
    Variable = c("Genotype", "Treatment"),
    Design = "Genotype + Treatment + Genotype:Treatment"
  )
## [1] These are the DESeq2 contrasts, that you have to use like list(c("ContrastA", "ContrastB")) : 
## [1] "Intercept"                  "Genotype_gh3a_vs_Tak1"     
## [3] "Treatment_OPDA_vs_Mock"     "Genotypegh3a.TreatmentOPDA"
## [1] These are the costumized contrasts, that you have to use like ContrastA - ContrastB : 
## [1] "Tak1"      "gh3a"      "Mock"      "OPDA"      "Tak1_Mock" "Tak1_OPDA"
## [7] "gh3a_Mock" "gh3a_OPDA"
## Error in DESeq2_simple(output_path = output_Dir_DESeq2, sampleDir = sampleDir, : Choose one of the previous contrasts

After running DESeq2_simple with the Contrast parameter omitted, an error message appears, prompting us to choose one of the contrasts printed to the console. In this example, we want to compare Tak-1 genotype in OPDA treatment samples vs. Tak-1 genotype in Mock conditions. There are two different ways we can set this contrast:

  • Tak1_OPDA - Tak1_Mock.
  • list(c(“Treatment_OPDA_vs_Mock”)). Since we established Tak-1 as our reference genotype (first factor), the treatment OPDA vs Mock contrast will get this comparison in the reference genotype, i.e., Tak-1.
Tak1_OPDA_result <- DESeq2_simple(
    output_path = output_Dir_DESeq2,
    sampleDir = sampleDir,
    sample_table = sample_table,
    Include = NULL,
    Exclude = NULL,
    tx2gene = Marchantia7_tx2gene,
    Variable = c("Genotype", "Treatment"),
    Design = "Genotype + Treatment + Genotype:Treatment",
    Group = "NO",
    Name = "Tak1_OPDA",
    Contrast = Tak1_OPDA - Tak1_Mock,
    Reduced = FALSE,
    log2FCtopGO = 1,
    geneID2GO = geneID2GO,
    ontology = "BP",
    plot_similarity = TRUE,
    orgdb = "org.At.tair.db",
    semdata = save
  )

If we want to perform other possible analyses, these would be the contrast we would have to use:

  • OPDA vs Mock in gh3a genotype: gh3a_OPDA - gh3a_Mock, or list(c(“Treatment_OPDA_vs_Mock”, “Genotypegh3a.TreatmentOPDA”)).

  • gh3a vs Tak-1 in Mock conditions: gh3a_Mock - Tak1_Mock, or list(c(“Genotype_gh3a_vs_Tak1”)).

  • gh3a vs Tak-1 in OPDA conditions: gh3a_OPDA - Tak1_OPDA, or list(c(“Genotype_gh3a_vs_Tak1”, “Genotypegh3a.TreatmentOPDA”)).

  • Genes affected by the Genotype:Treatment interaction: (gh3a_OPDA - gh3a_Mock) - (Tak1_OPDA - Tak1_Mock), or list(c(“Genotypegh3a.TreatmentOPDA”)). This are genes that will be responding to OPDA treatment in the gh3a genotype in a different way that the expected from the response in Tak-1 genotype.

dir.create(file.path(output_dir, "DESeq2", "Genotype_Treatment_interaction"))
output_Dir_interaction <- file.path(output_dir, "DESeq2", "Genotype_Treatment_interaction")
Genotype_Treatment_result <- DESeq2_simple(
    output_path = output_Dir_interaction,
    sampleDir = sampleDir,
    sample_table = sample_table,
    Include = NULL,
    Exclude = NULL,
    tx2gene = Marchantia7_tx2gene,
    Variable = c("Genotype", "Treatment"),
    Design = "Genotype + Treatment + Genotype:Treatment",
    Group = "NO",
    Name = "Genotype_Treatment_interaction",
    Contrast = list(c("Genotypegh3a.TreatmentOPDA")),
    Reduced = FALSE,
    log2FCtopGO = 1,
    geneID2GO = geneID2GO,
    ontology = "BP",
    plot_similarity = TRUE,
    orgdb = "org.At.tair.db",
    semdata = save
  )

Sample selection

By default, all samples in sampleDir will be included for the differential expression analysis using DESeq2_simple function. However, it could be interesting to perform an analysis including only a subset of these samples. For example, let’s think about the scenario where we want to study DEGs by OPDA in Tak-1 genotype ignoring the existence of gh3a in the experiment. Thus, we would only want to include samples belonging to Tak-1 genotype. There are two possible ways of doing that in DESeq2_simple function: by using Include or Exclude parameters.

  • Include: we define features in a vector that all samples must present to be included in the analysis. In this example, it would be Include = c("Tak1").
  • Exclude: we define features in a vector that we want to remove from the analysis. All samples containing any of these features will be removed. In this example, it would be Exclude = c("gh3a").

In our example, both parameters would work the same way, i.e., only including Tak-1 samples in the analysis, both in OPDA and Mock treatments.

PCA

Apart from differential expression analysis, all samples included in the DESeq2_simple function will be compute for principal component analysis and we will obtain a png file of PC1 vs PC2 using 500 top genes by default. If we do not want to generate the PCA, we can set the PCA parameter to FALSE within the DESeq2_simple function. If we want to perform a different principal component analysis (plotting other component, using a different number of top genes, changing colors, etc…) we can do it by running the run_pca_analysis function.

Output

After running DESeq2_simple function, there are some files generated with results from the differential gene expression analysis:

  • Two tab-delimited text files are generated including results from the differential expression analysis of all genes (Name.txt) and only significantly expressed genes, i.e., \(padj \le 0.05\) (Name_Sig.txt).
##            baseMean log2FoldChange     lfcSE       stat       pvalue
## Mp1g00070  147.7289    -0.87073377 0.2215426 -3.9303223 8.483207e-05
## Mp1g00080 8077.3196    -0.54069976 0.1329801 -4.0660199 4.782285e-05
## Mp1g00090  531.2465    -0.05551592 0.1356234 -0.4093389 6.822909e-01
## Mp1g00100 1127.5365     0.87159546 0.1302033  6.6941140 2.169823e-11
## Mp1g00110  419.9734    -0.10280589 0.1763512 -0.5829610 5.599196e-01
## Mp1g00120  511.5859    -0.35720864 0.1180411 -3.0261372 2.476999e-03
##                   padj
## Mp1g00070 3.932574e-04
## Mp1g00080 2.335654e-04
## Mp1g00090 7.772935e-01
## Mp1g00100 2.389950e-10
## Mp1g00110 6.791640e-01
## Mp1g00120 7.944981e-03
  • A PDF file including a heatmap of \(log2(n + 1)\) normalized values of all samples included in the analysis (Name_heatmap.pdf).
Heatmap of normalized values of DEGs in the OPDA vs Mock comparison for Tak-1 genotype

Heatmap of normalized values of DEGs in the OPDA vs Mock comparison for Tak-1 genotype

  • A PNG file containing the PCA plot for all samples included in the analysis. This is generated only if PCA parameter is TRUE (which is the default) (PCA_Name.png).
PCA plot of PC1 vs PC2, using 500 top genes

PCA plot of PC1 vs PC2, using 500 top genes

  • A “topGO” folder, including results of GO enrichment analysis for Up, Down and Up+Down DEGs (\(padj \le 0.05\), and \(log2(FC) \ge 1\) for Up-regulated and \(log2(FC) \le -1\) for Down-regulated genes):
    • Two tab-delimited text files including results from the GO enrichment analysis. One include detailed results (topGO_Name_All/Up/Down_results.txt) and the other only enriched GO terms and their p-values (topGO_Name_All/Up/Down_pvals.txt).
##            Annotated Significant Expected
## GO:0044238      6226         509   528.04
## GO:0045893       688          59    58.35
## GO:0048574        90           9     7.63
## GO:0010154      1454         150   123.32
## GO:0044706       462          48    39.18
## GO:0044281      1878         198   159.28
##                                                          Term         pval
## GO:0044238                          primary metabolic process 0.0061337475
## GO:0045893 positive regulation of DNA-templated transcription 0.0060918982
## GO:0048574                 long-day photoperiodism, flowering 0.0049341244
## GO:0010154                                  fruit development 0.0002647838
## GO:0044706               multi-multicellular organism process 0.0321583675
## GO:0044281                   small molecule metabolic process 0.0091339926
##            Enrichment
## GO:0044238   1.006982
## GO:0045893   1.056274
## GO:0048574   1.231723
## GO:0010154   1.270691
## GO:0044706   1.279712
## GO:0044281   1.298621
  • Three PDF files for GO enrichment visualization:
    • A tree map plot of semantic-clustered GO terms (topGO_Name_All/Up/Down_Treemap.pdf).

    • A scatter plot of semantic-clustered GO terms (topGO_Name_All/Up/Down_Scatterplot.pdf).

    • A bubble plot of all enriched GO terms in descendant order according to their enrichment (topGO_Name_All/Up/Down_bubble.pdf).

Tree map of Gene Ontologies enriched in Up-regulated genes

Tree map of Gene Ontologies enriched in Up-regulated genes

Semantic cluster of Gene Ontologies enriched in Up-regulated genes

Semantic cluster of Gene Ontologies enriched in Up-regulated genes

Bubble plot of Gene Ontologies enriched in Up-regulated genes

Bubble plot of Gene Ontologies enriched in Up-regulated genes

Clustering of co-expressed genes

WGCNA_Module() function general scheme

Another application of the package is the generation of co-expressed gene modules by the implementation of components from the WGCNA3 package. This functionality is included in the WGCNA_Modules function. As DESeq2_simple function, it requires to specify the output_path, sampleDir, sample_table, Variables, tx2gene, Name, geneID2GO and Annotation parameters. The specific parameters that have also to be included in this function are:

  • DEGs: A vector of Gene IDs. This will be the input set of genes that we want to cluster. The Gene IDs must correspond to the transcriptome that we are using in the analysis.
  • Power: Soft thresholding power used in the network construction (check the WGNA manual or tutorial for more info). If you run the WGCNA_Modules function without specifying the Power parameter, it will stop and generate a “softthresholdingpower.pdf” file to help you select an appropriate power value. As an alternative, this table suggests the power to select depending on the number of samples and type of network: Suggested Soft thresholding powers

Most of the advanced parameters in for the blockwiseModules function of the WGCNA package for the module generation are set to default.

As an example, we will use the output DEGs from the Genotype:Treatment interaction in DESeq2_simple, i.e., genes that significantly respond to the OPDA treatment in the gh3a genotype in a different way than the Tak-1 genotype. This different response can be in different ways, so we will use the gene co-expression modules to explore these responses.

DESeq2_simple_output <- read.table(file.path(output_Dir_interaction, "Genotype_Treatment_interaction.txt"), 
                                   header = TRUE)

## We get rownames (GeneIDs) of the output data frame from DESeq2_simple(),
## applying a filter of log2FC > 1 for Up-regulated DEGs and log2FC < -1 for
## Down-regulation.

DEGs <- rownames(DESeq2_simple_output %>% filter(log2FoldChange > 1 | log2FoldChange < -1))

Now that we already have selected our input gene set, we can run the WGCNA_Modules function:

library(WGCNA)
dir.create(file.path(output_dir, "WGCNA"))
output_WGCNA <- file.path(output_dir, "WGCNA")
COIOPDA_int_Tak1OPDA_WGCNA <- WGCNA_Modules(output_path =output_WGCNA,
                                   sampleDir = sampleDir,
                                   sample_table = sample_table,
                                   DEGs = DEGs,
                                   Variables = c("Genotype", "Treatment"),
                                   tx2gene = Marchantia7_tx2gene,
                                   Name = "Genotype_Treatment_Interaction",
                                   NumberCol = 1,
                                   geneID2GO = geneID2GO,
                                   semdata = save)
##  Flagging genes and samples with too many missing values...
##   ..step 1
## pickSoftThreshold: will use block size 886.
##  pickSoftThreshold: calculating connectivity for given powers...
##    ..working on genes 1 through 886 of 886
##    Power SFT.R.sq  slope truncated.R.sq mean.k. median.k. max.k.
## 1      1    0.263  0.783          0.143  331.00   327.000  483.0
## 2      2    0.403 -0.390          0.352  179.00   158.000  345.0
## 3      3    0.762 -0.636          0.874  115.00    86.100  274.0
## 4      4    0.825 -0.726          0.933   81.60    50.200  230.0
## 5      5    0.857 -0.769          0.957   61.60    31.700  199.0
## 6      6    0.886 -0.799          0.980   48.60    21.000  175.0
## 7      7    0.883 -0.823          0.945   39.60    15.400  158.0
## 8      8    0.908 -0.837          0.958   33.10    11.700  143.0
## 9      9    0.919 -0.840          0.963   28.20     8.820  131.0
## 10    10    0.930 -0.856          0.966   24.40     6.890  121.0
## 11    12    0.935 -0.865          0.970   18.90     4.310  105.0
## 12    14    0.944 -0.874          0.969   15.10     2.870   92.3
## 13    16    0.956 -0.880          0.979   12.50     1.980   82.4
## 14    18    0.955 -0.878          0.973   10.50     1.430   74.4
## 15    20    0.952 -0.898          0.969    8.96     1.090   67.6
## 16    22    0.962 -0.905          0.970    7.76     0.806   62.0
## 17    24    0.968 -0.905          0.970    6.79     0.597   57.1
## 18    26    0.975 -0.900          0.979    6.00     0.444   53.1
## 19    28    0.973 -0.905          0.970    5.35     0.360   49.5
## 20    30    0.976 -0.919          0.973    4.80     0.278   46.3
## 21    32    0.977 -0.930          0.972    4.33     0.218   43.5
## 22    34    0.982 -0.939          0.979    3.93     0.174   41.0
## 23    36    0.977 -0.931          0.973    3.58     0.137   38.7
## 24    38    0.985 -0.934          0.982    3.28     0.109   36.6
## 25    40    0.974 -0.936          0.968    3.02     0.086   34.7
## Error in WGCNA_Modules(output_path = output_WGCNA, sampleDir = sampleDir, : Choose a power based on softthresholdingpower.pdf

We did not include the Power parameter, so the function stopped and warned us to select one based on the “softthresholdingpower.pdf” file.

R^2 values as a function of the soft thresholds

R^2 values as a function of the soft thresholds

Based on this plot, we will select 7 as the soft thresholding power since it is the first value that rises the threshold.

COIOPDA_int_Tak1OPDA_WGCNA <- WGCNA_Modules(output_path =output_WGCNA,
                                   sampleDir = sampleDir,
                                   sample_table = sample_table,
                                   DEGs = DEGs,
                                   Variables = c("Genotype", "Treatment"),
                                   tx2gene = Marchantia7_tx2gene,
                                   Name = "Genotype_Treatment_Interaction",
                                   Power = 7,
                                   NumberCol = 1,
                                   geneID2GO = geneID2GO,
                                   semdata = save)
##  Flagging genes and samples with too many missing values...
##   ..step 1
## pickSoftThreshold: will use block size 886.
##  pickSoftThreshold: calculating connectivity for given powers...
##    ..working on genes 1 through 886 of 886
##    Power SFT.R.sq  slope truncated.R.sq mean.k. median.k. max.k.
## 1      1    0.263  0.783          0.143  331.00   327.000  483.0
## 2      2    0.403 -0.390          0.352  179.00   158.000  345.0
## 3      3    0.762 -0.636          0.874  115.00    86.100  274.0
## 4      4    0.825 -0.726          0.933   81.60    50.200  230.0
## 5      5    0.857 -0.769          0.957   61.60    31.700  199.0
## 6      6    0.886 -0.799          0.980   48.60    21.000  175.0
## 7      7    0.883 -0.823          0.945   39.60    15.400  158.0
## 8      8    0.908 -0.837          0.958   33.10    11.700  143.0
## 9      9    0.919 -0.840          0.963   28.20     8.820  131.0
## 10    10    0.930 -0.856          0.966   24.40     6.890  121.0
## 11    12    0.935 -0.865          0.970   18.90     4.310  105.0
## 12    14    0.944 -0.874          0.969   15.10     2.870   92.3
## 13    16    0.956 -0.880          0.979   12.50     1.980   82.4
## 14    18    0.955 -0.878          0.973   10.50     1.430   74.4
## 15    20    0.952 -0.898          0.969    8.96     1.090   67.6
## 16    22    0.962 -0.905          0.970    7.76     0.806   62.0
## 17    24    0.968 -0.905          0.970    6.79     0.597   57.1
## 18    26    0.975 -0.900          0.979    6.00     0.444   53.1
## 19    28    0.973 -0.905          0.970    5.35     0.360   49.5
## 20    30    0.976 -0.919          0.973    4.80     0.278   46.3
## 21    32    0.977 -0.930          0.972    4.33     0.218   43.5
## 22    34    0.982 -0.939          0.979    3.93     0.174   41.0
## 23    36    0.977 -0.931          0.973    3.58     0.137   38.7
## 24    38    0.985 -0.934          0.982    3.28     0.109   36.6
## 25    40    0.974 -0.936          0.968    3.02     0.086   34.7
##  Calculating module eigengenes block-wise from all genes
##    Flagging genes and samples with too many missing values...
##     ..step 1
##  ..Working on block 1 .
##     TOM calculation: adjacency..
##     ..will not use multithreading.
##      Fraction of slow calculations: 0.000000
##     ..connectivity..
##     ..matrix multiplication (system BLAS)..
##     ..normalization..
##     ..done.
##    ..saving TOM for block 1 into file RNAseqEasy-block.1.RData
##  ....clustering..
##  ....detecting modules..
##  ....calculating module eigengenes..
##  ....checking kME in modules..
##      ..removing 8 genes from module 1 because their KME is too low.
##      ..removing 2 genes from module 2 because their KME is too low.
##      ..removing 5 genes from module 3 because their KME is too low.
##      ..removing 1 genes from module 4 because their KME is too low.
##  ..merging modules that are too close..
##      mergeCloseModules: Merging modules whose distance is less than 0.25
##        Calculating new MEs...

There are 6 different clusters or modules that have been generated after the analysis. The average behavior of genes from each category are summarized in the “Summary_Modules_Name.pdf” file:

Summary of gene expression behavior in each co-expression module of genes responding to the Genotype:Treatment interaction

Summary of gene expression behavior in each co-expression module of genes responding to the Genotype:Treatment interaction

Output

  • A tab-delimited TXT file for each module, containing genes included in that module and its TPM expression in each sample of the analysis (Module_genes.txt).

  • A CSV file of two columns, including all genes used as input (DEGs parameter) and the module name where they were included (Name_geneInfo.csv).

  • Three PDF files:

    • Module-trait relationships heatmap of each module and each variable (moduleTraitRelationships.pdf).
    • A sample dendrogram to see how samples cluster in the analysis (sampleClustering.pdf).
    • A soft thresholding powers plot to help us to select the Power parameter (softthresholdingpower.pdf)
    • The summary plot showing the average behavior of genes included in each module (Summary_Modules_Name.pdf). We can modify the number of columns to organize modules in this figure by changing the NumberCol parameter, and the colors of the lines specifying a named vector in the Colors_plot parameter (colors as values, variables as names).
  • A “topGO” folder, including same output files for each module as DESeq2_simple() function.

Gene Ontology Enrichment

topGO_All() function general scheme

Gene Ontology Enrichment analysis in this package is based on topGO2 package, and the function that performs it is topGO_All(). This function is also integrated in the DESeq2_simple() and WGCNA_Module() functions. It requires a vector or data frame with transcript IDs (if it is a data frame, it should follow the DESeq2_simple() output format, where Gene IDs are row names). It also requires the Name, Annotation and orgdb parameters.

  • The type of ontologies to perform GO enrichment is set to Biological Processes (“BP”) as default, but it can be changed to Molecular Functions (“MF”) or Cellular Component (“CC”) in the ontology parameter.
  • Default algorithm is “weight01”, but it can be changed to “classic”, “elim”, “weight”, “lea” or “parentchild” (read topGO vignette for more information).
  • Default statistic is “fisher”, but it can be changed to “ks”, “t”, “globaltest” or “sum” (read topGO vignette for more information).
  • If we want to explore which genes are represented by specific GO terms of the analysis, we can set save_GeneNames parameter to TRUE and include a character vector with the GO IDs in the Ontologies parameter. It will generate a XLSX file with a spreadsheet for each ontology. Be sure that every GO ID is represented as an output of the topGO_All() function.

Bibliography

1.
Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. 15, 550 (2014).
2.
Alexa, A. & Rahnenfuhrer, J. topGO: Enrichment analysis for gene ontology. (2021).
3.
Langfelder, P. & Horvath, S. WGCNA: An r package for weighted correlation network analysis. 559 (2008).
4.
Patro, R., Duggal, G., Love, M. I., Irizarry, R. A. & Kingsford, C. Salmon provides fast and bias-aware quantification of transcript expression. Nature Methods 14, 417–419 (2017).
5.
Sayols, S. Rrvgo: A bioconductor package for interpreting lists of gene ontology terms. (2023) doi:10.17912/MICROPUB.BIOLOGY.000811.
6.
Wickham, H. et al. Welcome to the tidyverse. Journal of Open Source Software 4, 1686 (2019).
7.
8.
9.
Yu, G. Gene Ontology Semantic Similarity Analysis Using GOSemSim. in 207–215 (Springer US, 2020). doi:10.1007/978-1-0716-0301-7_11.